The paper "Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention"

An attention based model for image caption generation.

Implementation of an extension of the paper

Conclusion: Most of the captions have 10 to 11 words

Pre-Processing text(captions):

  1. Create tokens by removing punctuation marks, conv to lowercase, take only the top 5000 to save memory and set the rest as "UNK" for unknown words
  2. Create index to word and word to index mappings
  3. Pad sequences to make them all equal sized

Pre-Processing the images:

  1. Resize them into the shape of (299, 299)
  2. Normalize to [-1, 1] to make it correct format for InceptionV3 (CNN by google, 48 layers deep)

Image data format have default as 'channels_last' which indicates image data is represented in a three-dimensional array where the last channel represents the color channels, e.g. [rows][cols][channels].

Creating test and train data

  1. Combining image and captions (tf.data.Dataset API)
  2. train_test_split 80:20 ratio, random state = 42
  3. Shuffle and batch
  4. Shape of each image: (batch_size, 299, 299, 3)
  5. Shape of each caption: (batch_size, max_len)

Step 1: CNN

  1. Load pretrained Imagenet weights of Inception net V3 - saves RAM
  2. Use last layer of pre-trained model(V3) to train (882048)
  3. Extract features from test and train dataset of size (batch_size, 8*8. 2048)

Model Building:

  1. Set parameters
  2. Build the encoder, attention model & decoder

Encoder

Attention model

Decoder

Final output: vocab_size because we need to assign probability for every word in vocabulary

Model training & optimization:

  1. Set optimizer and loss object
  2. Create checkpoint path
  3. Create training and testing step functions
  4. Create loss function for test dataset

Model Evaluation:

  1. Define evaluation function using greedy search
  2. Test it on a sample data using BLEU score